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Toward a Test Theory for Assessing 
Student Understanding 



Robert J. Mislevy, Kentaro Yamamoto, and Steven Anacker 
Educational Testing Service 



Abstract 

The view of learning that underlies standard test theory is 
inconsistent with the view rapidly emerging from cognitive and educational 
psychology. Learners become more competent not simply by learning more 
facts and skills, but by reconfiguring their knowledge; by “chunking” 
information to reduce memory loads; and by developing strategies and 
models that help them discern when and how facts and skills are important. 
Neither classical test theory nor item response theory (IRT) is designed to 
inform educational decisions conceived from this perspective. This paper 
sketches the outlines of a test theory built around models of student 
understanding, as inspired by the substance and the psychology of the 
domain of interest. The ideas are illustrated with a simple numerical 
example based on Siegler’s balance beam tasks. Directions in which the 
approach must be developed to be broadly useful in educational practice are 
discussed. 
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Background 



When schooling became mandatory at the turn of the century, educators suddenly 
faced selection and placement decisions for unprecedented numbers of students, displaying 
the diversity of abilities and backgrounds that individuals bring to schooling (Glaser, 

198 1). Numbers of correct answers to multiple-choice test items were used to rank 
students according to their overall proficiencies in domains of tasks. These rankings were 
used in turn to predict students’ success in fixed educational experiences. 

Classical test theory (CTT) emerged when Spearman (e.g., 1907) applied statistical 
methods to study how reliable estimates of this overall proficiency would be from different 
test forms that might be constructed for the purpose. Extensions of this woiic led over the 
years to a vast armamentarium of techniques for building tests and making decisions with 
test scores (GuUiksen, 1950); to an axiomatic foundation for statistical inference about test 
scores (Lord, 1959; Lord & Novick, 1968; Novick, 1966); and to sophisticated techniques 
for partitioning test score variance according to facets of items, persons, and observational 
settings (Cronbach, Gleser, Nanda, & Rajaramam,1972). It is important to note that in all 
this work, the object of inference is overall proficiency — the test score, observed or 
expected — in terms of numbers of correct responses in a domain of items. 

Item response theory (IRT; see Hambleton, 1989, for an overview) represented a 
major practical advance over CTT by nKxieling probabilities of correct item response in 
terms of an unobservable proficiency variable. IRT solves many problems that were 
difficult under CTT, in equating, test construction, and adaptive testing. Advanced 
statistical methods have been brought to bear on inferential problems in IRT, including 
sophisticated estimation algorithms (e.g., Bock & Aitkin, 1981), techniques from missing- 
data theory (Mislevy, in press-a), and Bayesian treatments of uncertainty in models and 
parameters (Lewis, 1985; Mislevy & Sheehan, 1990; Tsutakawa & Johnson, 1988). The 
underlying psychological nxxiel remains quite simple, however, as in CTT, the focus 
remains on overall proficiency in a domain of items. From the perspective of IRT, two 
students with the same overall proficiency are indistinguishable. 

As useful as standard tests and standard test theoiy have proven in large-scale 
evaluation, selection, and placement problems, their focus on who is competent and Iww 
many items they answer can fall short when the goal is to improve individuals’ 
competencies. Glaser, Lesgold, and Lajoie (1987) p>oint out that tests can predict failure 
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without an imderstanding of what causes success, but intervening to prevent failure and 
enhance competence requires deeper understanding. 

The past decade has witnessed considerable progress toward the requisite 
understanding. Psychological research has moved away from the traditional laboratory 
studies of simple (even random!) tasks, to tasks that better approximate the meaningful 
learning and problem-solving activities that engage people in real life. Studies comparing 
the ways experts differ fix>m novices in applied problem-solving in domains such as 
physics and trouble-shooting (e.g., Chi, Feltovich & Glaser, 1981) reveal the central 
importance of knowledge structures — networks of concepts and interconnections among 
them — that impart meaning to patterns in what one observes and how one chooses to act. 
The process of learning is to a large degree expanding these structures and, importantly, 
recorfiguring them to incorporate new and qualitatively different connections as the level of 
understanding deepens. Educational psychologists have begun to put these findings to 
work in designing both instruction and tests (e.g., Glaser et al., 1987; Greeno, 1976; 
Marshall, 1985, in press). Again in the words of Glaser, Lesgold, and Lajoie (1987), 

“Achievement testing as we have defined it is a method of indexing stages of 
competence through indicators of the level of development of knowledge, 
skill, and cognitive process. These indicators display stages of performance 
that have been attained and on which further learning can proceed. They 
also show forms of error and misconceptions in knowledge that result in 
inefficient and incomplete knowledge and skill, and that need instructional 
attention.” (p.81) 

Paraphrasing Ohlsson and Langley (1985), Clancey (1986) summarizes the shift in 
perspective: “[to] describing mental processes, rather than quantifying performance with 
respect to stimulus variables; describing individuals in detail, not just stating generalities; 
and giving psychological interpretation to qualitative data, rather than statistical treatment to 
numerical measurements” (p. 391). 

An Approach to Modeling Student Understanding 

The modeling approach we are beginning to pursue can be encapsulated as follows: 

“Standard test theory volvcd as the application of statistical theory with a 
simple model of ability that suited the decision-making environment of mass 
educational systems. Broader educational options, based on insights into 
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the nature of learning and supported by more powerful technologies, 
demand a broader range of models of capabilities — still simple compared to 
the realities of cognition, but capturing patterns that inform a broader range 
of instructional alternatives. A new test theory can be brought about by 
applying to well-chosen cognitive models the same general principles of 
statistical inference that led to standard test theory when applied to the 
simple model.” (Mislevy, in press-b). 

The approach begins in a specific application by defining a universe of student 
models. This “supermodel” is indexed by parameters that signify distinctions between 
states of understanding. Symbolically, we shall refer to the (typically vector-valued) 
parameter of the student-model as t|. A particular set of values of t] specifies a particular 
student model, or one particular state among the universe of possible states the supermodel 
can accommodate. These parameters can be qualitative or quantitative, and qualitative 
parameters can be unordered, partially ordered, or completely ordered. A supermodel can 
contain any mixture of these types. Their nature is derived from the structure and the 
psychology of the learning area, the idea being to capture the essential distinctions among 
students. 

Any application faces a modeling problem, an item construction problem, and an 
inference problem. 

The modeling problem is delineating the states or levels of understanding in a 
learning domain. In meaningful applications this might address several distinct strands of 
learning, as understanding develops in a number of key concepts, and it might address the 
connectivity among those concepts. ^ Symbolically, this substep defines the structure of 
p(xl'n), where x represents observations. Obviously any model will be a gross 

simplification of the reality of cognition. A first consideration in what to include in the 
supermodel is the substance and the psychology of the domain: Just what are the key 



^ A particularly interesting special case occurs when the universe of student models can be expressed a 
performance models (Clancey, 1986). A performance model consists a knowledge base and manipulation 
rules that can be run on problems in a domain of interest A particular model can contain both knowledge 
and production rules that are incorrect or incomplete; the solutions it produces will be correct or incorrect in 
identifiable ways. Here the parameter tl ^>ecifics features of performance models. 
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concepts? What are important ways of understanding and misunderstanding them? What 
are typical paths to competence? A second consideration is the so-called grain-size 
problem, or the level of detail at which student-models should differ. A major factor in 
answering this question is the decision-making framework under which the modeling will 
take place. As Greeno (1976) points out, “It may not be critical to distinguish between 
models differing in processing details if the details lack important implications for quality of 
student performance in instmctional situations, or the ability of students to progress K) 
further stages of knowledge and understanding.” 

The item construction problem is devising situations for which students who differ 
in the parameter space are likely to behave in observably different ways. The conditional 
probabilities of behavior of different types given the unobservable state of the student are 
the values of p(xlTj), which may in turn be modeled in terms of another set of parameters, 
say |3. The p(xlTj) values provide the basis for inferring back about the smdent state. An 
elertKnt in x could contain a right or wrong answer to a multiple-choice test item, but it 
could instead be the problem-solving approach regardless of whether the answer is right or 
wrong, the quickness of a responding, a characteristic of a think-aloud protocol, or an 
expert’s evaluation of a particular aspect of the performance. The effectiveness of an item 
is reflected in differences in conditional probabilities associated with different parameter 
configurations, so an item may be very useful in distinguishing among some aspects of 
potential student models but useless for distinguishing among others. Tatsuoka (1989) 
demonstrates the relationship between item construction and inference about students’ 
strategies for subtracting mixed numbers. 

The inference problem is reasoning from observations to student models. The 
model-building and item construction steps provide x\ and p(xlTj). Let p(Tj) represent 
expectations about tj in a population of interest — possibly non-informative, possibly based 
on expert opinion or previous analyses. Bayes theorem can be employed to draw 
inferences about r\ given x via pfTjix) « p(xlT|) p(Tj). Thus p(Tjlx) characterizes belief 
about a particular student’s model after having observed a sample of the student’s behavior. 
Practical problems include characterizing what is known about P so as to determine p(xlT|), 
carrying out the computations involved in determining p(Tjlx), and, in some applications, 
developing strategies for efficient sequential gathering of observations. As we have noted, 
analogous problems have been studied in standard test theory, and the solutions there, 
because they are applications of general principles of statistical inference, generalize to 




4 



10 



models built around alternative psychological models. The models are more realistic and 
more ambitious, but the formalism is identical.^ 

Previous Research 

Research relevant to this approach has been carried out in a wide vaiiety of fields, 
including cognitive psychology, the psychology of mathematics and science education, 
artificial intelligence (AI) work on student modeling, test theory, and statistical inference. 
Cognitive scientists have suggested general structures such as “fi:ames” or “schemas” that 
can serve as a basis for modeling understanding (e.g., Minsky, 1975; Rumelhart, 1980), 
and have begun to devise tasks that probe their features (e.g., Marshall, 1989, in press). 
Researchers interested in the psychology of learning in subject areas such as proportional 
reasoning have focused on identifying key concepts, studying how they are typically 
acquired (e.g., in mechanics, Clement, 1982; in ratio and proportional reasoning, Karplus, 
Pulos, & Stage, 1983), and constructing observational settings that allow one to infer 
students’ understanding (e.g., van den Heuvel, 1990; McDermott, 1984). We make no 
effort here to review these li',eratures, but point out that our work can succeed only by 
building upon their foundations. Our potential contribution would be to the structures and 
mechanics of model-building and inference. The following sections briefly mention some 
important work along these lines from test theory and statistics. 

Modeling Student Behavior 

The standard models of educational measurement are concerned solely with 
examinees’ tendencies to answer items correctly — that is, their overall proficiency. 
Recently, however, models that focus on patterns other than overall proficiency have begun 
to appear the test theory literature. Some examples that are relevant to educational 
applications are listed below. 



2 Advocates of student modeling emphasize the qualitative aspects of student models. Our approach is 
compatible with this view, as it is possible to build universes of qualitative models, indexed by parameters 
that distinguish their features. Our knowledge about a particular student's model is imperfect, however. It 
can be expressed in terms of probabilities expressing the plausibility of various models, given what has 
been observed. Probabilities are quantitative, and admit to a calculus of manipulation. We might Uius 
employ a quantitative model for our (imperfect) knowledge about qualitative student models. 



1. Mislevy and Verhelst’s (1990) mixture models for item responses when different 
examinees follow different solution strategies or use alternative mental models. When a 
single IRT model cannot capture key distinctions among examinees, it may suffice to posit 
qualitatively distinct classes of examinees and use IRT models to summarize distinctions 
among examinees within these classes. 

2. Wilson’s (1989b) Saltus model for characterizing stages of conceptual development 
This model parameterizes the differential patterns of strength and weakness expected as 
learners progress through successive conceptualizations of a domain. 

3. Falmagne’s (1989) and Haertel’s (1984) latent class models for Binary Skills. 

These models are intended for domains in which competence can be described by the 
presence or absence of several (possibly complex) elements of skill or knowledge, and 
observational situations can be devised that demand various combinations of these skills. 
Also see Paulson (1986) for an alternative use of latent class modelling in cognitive 
assessment. 

4. Embretson’s (1985) multicomponent models for integrating item construction and 
inference within a unified cognitive model. The conditional probabilities of solution steps 
given a multifaceted student model are given by IRT-like statistical structures. 

5. Tatsuoka’s (1989) /?w/e5pace analysis. Tatsuoka uses a generalization of IRT 
methodology to define a metric for classifying examinees based on likely patterns of item 
response given patterns of knowledge and strategies. 

6. Yamamoto’s (1987) //yftr/d model for dichotomous responses. The //y6r/d model 
characterizes an examinee as either belonging to one of a number of classes associated with 
states of understanding, or in a catch-all IRT class. This approach might be useful when 
certain response patterns signal states of understanding for which particular educational 
experiences are known to be effective. Listructional decisions are triggered by these 
patterns if they are detected, but by overall proficiency when no more targeted action can be 
provided. 

7. Masters and Mislevy’s (in press) and Wilson’s (1989a) use of the Partial Credit 
rating scale model to characterize levels of understanding, as evidenced by the nature or 
approach of a performance rather than its correctness. These applications incorporate into a 
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probabilistic framework the cognitive perspective underlying Biggs and CoUis’s (1982) 
SOLO taxonomy for describing salient qualities of performances. 

These are the rudiments of models upon which concept-referenced achievement 
measures can be based. Applications to date have been fairly limited, and most have 
addressed one-to-many relationships between an underlying knowledge state and 
observable behavior. That is, a single (possibly unordered or multifaceted) variable has 
been used to characterize examinees, and performance on all items is modeled in terms of 
this variable. What is lacking ftt>m the point of view of the educator is the fact that 
meaningful real world tasks are rarely segregated into these neat little sets. Rather, they 
often involve multiple concepts, connections among larger concepts, and transformations 
among alternative representations of a domain. While the simple tasks that characterize 
one-to-many domains are essential at early stages of learning, more complex tasks that 
involve multiple concepts in many-to-many relationships are needed to promote the 
integration among concepts that form the core of what is often called “higher-level 
learmng.” 

Inference Networks 

Recent developments in the context of probability-based inference networks 
(Lauritzen & Spiegelhalter, 1988; Pearl, 1988) offer a capability for integrating conceptual 
models of the type described above. These probability-based structures are attractive for 
educational measurement because they permit a coherent extension of the modeling 
approach and inferential logic of the new cognitive-assessment models mentioned above. 
To show how the approach might be applied in the educational setting, we first discuss an 
application in the setting of medical diagnosis. 

MUNIN is an inference network that organizes knowledge in the domain of 
electromyography — the relationships among nerves and muscles. Its function is to 
diagnose nerve/muscle disease states. The interested reader is referred to Andreassen, 
Woldbye, Falck, and Andersen (1987) for a fuller description. The prototype discussed in 
that presentation and used for our illustration concerns a single arm muscle, with concepts 
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represented by twenty-five nodes and their interactions represented by causal links.^ A 
graphic representation of the network appears in Figure 1. 

[Figure 1 about here] 

The rightmost column of nodes in Figure 1 concerns outcomes of potentially 
observable variables, such as symptoms or test results. These outcomes are the x vector in 
our earlier notation. The middle layers are “pathophysiological states,” or syndromes. 
These drive the probabilities of observations. The leftmost layer is the underlying disease 
state, including three possible diseases in various stages, no disease, or “Other” — a 
condition not built into in the system. These states drive the probabilities of syndromes. It 
is assumed that a patient’s true state can be adequately characterized by values of these 
disease and syndrome states — our T| parameter. Paths indicate conditional probability 
relationships, which are to be determined either logically, subjectively, purely empirically, 
or through model-based statistical estimation. In particular, the paths ending at observables 
represent p(xlri). Note that the probabilities of observables depend on some syndromes, 
but not others. The lack of a path signifies conditional independence. Note also that a 
given test result can be caused by different disease combinations. 

As a patient enters the clinic, the diagnostician’s state of knowledge about him is 
expressed by population base rates, or p(T|). This is depicted in Figure 1 by bars that 
represent the base probabilities of disease and syndrome states. Base rates of observable 
test results are similarly shown. Tests are carried out, one at a time or in clusters, and with 
each result the probabilities of disease states are updated. The expectations of tests not yet 
given are calculated, and it can be determined which test will be most informative in 
identifying the disease state. Knowledge is thus accumulated in stages, from p(T|) to 
p(T|lxi) after observing the first subset of tests, to p(T||xi,X2) after the second, and so on, 
with each successive test selected optimally in light of knowledge at that point in time. 
Figure 2 illustrates the state of knowledge after a number of electromyographic test results 
have been observed. Observable nodes with results now known are depicted with shaded 
bars representing observed values For them, knowledge is perfect. The implications of 
these results have been propagated leftward to syndromes and disease states, as shown by 



^ The ESPRIT team has generalized the application to address clusters of interrelated muscles in a network 
containing over a thousand nodes. 
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distributions that differ from the base rates in Figure 1. These values guide the decision to . 
test further or initiate a treatment. Finally, updated beliefs about disease states have been 
propagated back toward the right to update expectations about the likely outcomes of test 
not yet administered. These expectations, and the potential they hold for further updating 
knowledge about the disease states, guide the selection of further tests. 

[Figure 2 about here] 

Inference Networks in the Educational Setting 

To see how the ideas underlying MUNIN apply to the educational setting, consider 
the following analogy: 



Medical Application 
Observable symptoms, medical tests 



Disease states, syndromes 

Architecture of interconnections based 
on medical theory 

Conditional probabilities given by 
physiological models, empirical data, 
expert opinion 



Educational Application 

Test items, verbal protocols, teachers’ 
ratings of levels of understanding, 
solution traces 

States or levels of understanding of 
key concepts, available strategies 

Architecture of interconnections based 
on cognitive and educational theory 

Conditional probabilities given by 
psychological models, empirical data, 
expert opinion 



The definitions of key concepts will be guided by theorized and observed stages of 
learning in the area, and the connections with observables will be expressed through 
measurement models such as those discussed above. The initialization of the probabilities 
in the network will be accomplished by one or more methods: clinical analysis, with skilled 
interviewers assessing in detail the nature of students’ understandings and related these 
understandings to task performances; statistical analysis of data concerning selected models 
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for portions of the larger network (Mislevy & Verhelst, 1990); or theoretical analysis, in 
which logic or theory provides expectations for outcomes under hypothesized cognitive 
states. After the initialization phase, connections can be updated periodically with the larger 
amounts of less precise data that will be accumulated as students provide information about 
the adequacy of the relationships embodied in the networic and the accuracy of the baseline 
and conditional probabilities. 



A Numerical Example 

Siegier’s balance beam tasks 

Kuhn (1970) emphasizes the central role that exemplars, or small, archetypical 
examples, play in science. Textbook examples are the vehicle through which students are 
acculturated to the concepts and relationships of a particular way of viewing a class of 
phenomena — a paradigm, in Kuhn’s words. They function almost like parables or 
morality tales. New paradigms are introduced with new exemplars, that introduce new 
concepts, highlight differences between the new paradigm and the old, and demonstrate 
how the new way of thinking solves problems the old way could not. Modeling the states 
of the electron in the hydrogen atom possesses this status in quantum mechanics. 

Explaining children’s understanding of balance beam problems, an exemplar from 
developmental psychology originated by Piaget, is approaching the same status in test 
theory (e.g., Kempf, 1983, Mislevy, in press-b, and Wilson, 1989b). Robert Siegler’s 
balance beam tasks yield data that are, on the surface, indistinguishable from standard test 
data, but there are two key distinctions: 

1 . What is important about examinees is not their overall probability of answering 
items correctly, but their (unobservable) state of understanding of the domain. 

2. Children at less sophisticated levels of understanding initially get certain problems 
right for the wrong reasons. These items are more likely to be answered wrong at 
intermediate stages, as understanding deepens! They are bad items by the standards 
of classical test theory and IRT, because probabilities of correct response do not 
increase monotonically with increasing total test score. From the perspective of the 
developmental theory, however, not only is this reversal expected, but it plays an 
important role in distinguishing among children with different ways of thinking 
about the problems. 
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Attempting to study children’s reasoning in a manner less subjective than Piaget’s 
unstructured interviews, Siegler (1981) devised a series of balance beam tasks like the one 
illustrated in Figure 3. Varying numbers of weights are placed at varying locations on a 
balance beam. The child predicts whether the beam will tip to left, to the right, or remain in 
balance. Piaget’s analysis of children’s behavior on balancing tasks (Inhelder & Piaget, 
1958), posits that a child wUl respond in accordance with his or her stage of understanding. 
The usual stages through which children progress can be described in terms of successive 
acquisition of the rules listed below. 

[Figure 3 about here] 

Rule I : If the weights on both sides are equal, it will balance. If they are not equal, the 

side with the heavier weight will go down. (Weight is the “dominant dimension,” 
because children are generally aware that weight is important in the problem earlier 
than they realize that distance from the fulcrum, the “subordinate dimension,” also 
matters.) 

Rule II : If the weights and distances on both sides are equal, then the beam will balance. 

If the weights are equal but the distances are not, the side with the longer distance 
will go down. Otherwise, the side with the heavier weight will go down. (A child 
using this rule uses the subordinate dimension only when inforaiation from the 
dominant dimension is equivocal.) 

Rule ni : Same as Rule II, except that if the values of both weight and length are unequal 
on both sides, the child will “muddle through” (Siegler, 1981, p.6). (A child using 
this rule now knows that both dimensions matter, but doesn’t know just how they 
combine. Responses will be based on a strategy such as guessing.) 

Rule rV: Combine weights and lengths correctly (i.e., compare torques, or products of 
weights and distances). 

It was thus hypothesized that each child could be classified into one of five stages — 
the four characterized by the rules, or an earlier “preoperational” stage in which neither 
weight nor length are thought to bear any systematic relationship to the action of the beam. 

Siegler developed six types of problems listed below to distinguish among children 
at different stages of reasoning. (See Figure 4 for an example of each.) 
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Fqnal problems (E), with matching weights and lengths on both sides. 

Dominant problems (D), with unequal weights but equal lengths. 

Subordinate problems (S), with unequal lengths but equal weights. 

Conflict-dominant problems (CD), in which one side has greater weight, the other has 
greater length, and the side with the heavier weight will go down. 

Conflict-suhordinatp. problems (CS), in which one side has greater weight, the other has 
greater length, and the side with the greater length will go down. 

Conflict-equal problems (CE), in which one side has greater weight, the other has greater 
length, and the beam will balance. 

[Figure 4 about here] 

Table 1 shows the probabilities of correct response that would be expected from 
groups of children in different stages, if their responses were in complete accordance the 
hypothesized rules. Scanning across the rows reveals how the probability of a correct 
response to a given type of item does not always increase as level of understanding 
increases. For example. Stage II children tend to answer CD items right for the wrong 
reason, while Stage III children, now aware of a conflict, flounder. 

[Table 1 about here] 

A latent class model for balance beam tasks 

If the theory were p>erfect, the columns in Table 1 would give probabilities of 
correct response to the various types of items from children at different stages of 
understanding. Observing a correct response to an S item, for example, would eliminate 
the possibility that the child was in Stage I. But because the model is not perfect^, and 
because children make slips and lucky guesses, any response could be observed from a 
child in any stage. A latent class model (Lazarsfeld, 1950) can be used to express the 



^ This model assumes that the five states are exhaustive and mutually exclusive. Alternative models, such 
as those of Tatsuoka and Yamamoto mentioned earlier, could be used to relax these restrictions. 
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structure posited in Table 1 while allowing for some “noise” in real data (see Appendix for 
details). Instead of expecting incorrect responses with probability one to S items from 
Stage 1 children, we might posits some small firaction of correct answers — p(S correct) 
Stage=I). Similar probabilities of “false positives” can be estimated for other cells in Table 
1 containing O’s. In the same spirit, probabilities less than one, due to “false negatives,” 
can be estimated for the cells with I’s. Note that inferences cannot be as strong when these 
uncertainties are present; a correct response to an S item still suggests that a child is 
probably not in Stage I, but no longer is it proof positive. 

Expressing this model in the notation introduced above, T) represents stage 
membership, x represents item responses, and pCxlTj) are conditional probabilities of 
correct responses to items of the various types from children in different stages — a noisy 
version of Table 1. The proportions of children in a population of interest at the different 
stages are p(Tj), and the probabilities that convey our knowledge about a child’s stage after 
we have observed his responses are p(T|lx). 

Siegler created a 24-task test comprised of four tasks of each type. He collected 
data from 60 children, from age 3 up through college age, at two points in time, for a total 
of 120 response vectors. We fit a latent class model to these data using the HYBRIL 
computer program (Y amamoto, 1987), obtaining the conditional probabilities — pCxlti) — 
shown in Table 2, and the following vector summarizing the (estimated) population 
distribution of stage membership: 

p(T)) = (Prob(Stage=0), Prob(Stage=I), ..., Prob(Stage=TV)) 

= (.257,.227,.163,.275,.078) . 

[Table 2 about here] 

Note that different types of items are differentially useful to distinguish among 
children at different levels. E items, for example, are best for distinguishing Stage 0 
children from everyone else. CD items, which would be dropped from standard tests 
because their probabilities of correct response do not have a strictly increasing relationship 
with total scores, help differentiate among children at Stages n, El, and IV. 

Figure 5 depicts the state of knowledge about a child before observing any 
responses using the conventions of the MUNIN figures. Just one item of each type is 
shown rather than all four for simplicity. The corresponding status of an observable node 
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(i.e., an item type) is the expectation of a correct response from a child selected at random 
from the population. The path from the stage-membership node to a particular observable 
node represents a row of Table 2. 



[Figure 5 about here] 



Adaptive testing 

Figure 5 represents the state of our knowledge about a child’s reasoning stage and 
expected responses before any actual responses are observed. How does knowledge 
change when a resp>onse is observed? One of the children in the sample, Douglas, gave an 
incorrect response to his first S item. This could happen regardless of Douglas’ tme stage; 
the probabilities are obtained by subtracting the entries in the S row of Table 2 from 1.000, 
yielding, for Stages 0 through IV, .667, .973, .116, .019, and .057 respectively. This is 
the likelihood function for “n induced by the observation of the response. The bulk of the 
evidence is for Stages 0 and I. Combining these values with the initial stage probabilities 
p(Ti) via Bayes theorem yields updated stage probabilities, p(T|lincorrect response to an S 
item): for Stages 0 through IV respectively, .41, .52, .04, .01, and .01. Expectations for 
items not yet administered also change. They are averages of the probabilities of correct 
response expected from the various stages, now weighted by the new stage membership 
probabilities. The state of knowledge after observing Douglas’ first response is depicted in 
Figure 6 (see Appendix for details; also see Macready & Dayton, 1989.) 

[Figure 6 about here] 

In a simulation of adaptive testing, we up>dated our knowledge about Douglas one 
response at a time, at each step looking at his actual response to an item expected to most 
substantially reduce our uncertainty about his stage membership. Figure 7 charts 
probabilities of stage membership for Douglas after each of the first ten items, showing that 
we quickly converge to Stage 0. 



[Figure 7 about here] 



Extending the paradigm 

The balance beam exemplar illustrates the challenge of inferring states of 
understanding, but it addresses development of only a single key concept. A major thrust 
of our proposal is to characterize interconnections among distinct lines of development. 
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This section takes a small step in this direction by discussing a hypothetical extension to the 
exemplar, namely, the ability to carry out the arithmetic operations needed to calculate 
torques. For illustrative purposes, we simply posit a skill to carry these calculations out 
reliably, either possessed by a child or not. Obviously states of understanding could be 
developed in greater detail here. 

Calculating and comparing torques to solve the “conflict” problems characterizes 
Stage IV. But if a child at Stage IV cannot carry out the calculations reliably, his pattern of 
correct and incorrect responses would be hard to distinguish from that of a child in Stage 
m. Although the two children might answer about the same number of items correctly, the 
instruction appropriate for them would differ dramatically. And children at any stage of 
understanding of the balance beam might be able to carry out the computational operations 
in isolation. The goal of the extended system is to infer both balance-beam understanding 
and computational skill. To make the distinctions among states of understanding in this 
extended domain, we introduce two new types of observations: 

1 . Items isolating computation, such as “Which is greater, 3 x 4 or 5 x 27’ 

2 . Probes for introspection about solutions to conflict items: “How did you get your 
answer?” 

Figure 8 offers one possible structure for this network. Others could be entertained, 
and in practice one would compare the degree to which they accord with observed data. To 
keep the diagram simple, only one balance-beam task each for an S and a CS task are 
illustrated. E and D items would have the same paths as the S task, and CD and CE tasks 
would have the same paths as the CS tasks. Also, the paths from Stage 0, 1, and H 
indicators to balance beam tasks are not drawn in. The structure of paths, but not 
necessarily the values, would be the same as those connecting the Stage III indicator to 
those tasks. 

[Figure 8 about here] 

There are three kinds of unobservable variables in the system. The first group 
expresses level of understanding in the balance beam domain. It proves convenient to 
express stage membership in terms of dichotomous indicator variables for each stage, 
because of the special relationship of Stage IV to computational skill. Second is the ability 
to carry out the calculations involved in computing torques. The third concerns the 
integration of balance-beam understanding and calculating proficiency. Specifically, we 
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posit an indicator for whether a child both is in Stage IV and possesses the requisite 
computational skills. Other feaUires of the network worth mentioning are as follows. 

1 . The probabilities of the pure computation items depend on the unobservable 
computation variable only; they are conditionally independent of level of balance 
beam understanding. 

2. The correctness aspect of an answer has only two possibilities, right or wrong, but 
an explanation can fall into five categories corresponding to levels of 
understanding. A Stage III child might give an explanation consistent with Stages 
0, 1, n, or in, but would not give a Stage IV explanation. Theory thus p>osits that 
the conditional probability of a Stage K response from a Stage J child is zero if 
K>J. Conditional probabilities for might be estimated fiom data or based on 
experts’ experience. It may turn out, for example, that the most likely explanation 
for an E task fiom people at Stage IV would probably be a Stage II explanation: “It 
balances because both the weights and distances arc equal.” 

3 . For children in Stages 0 through HI, both the right/wiong answers and the “How” 
answers to balance beam tasks depend only on level of understanding. Because 
they do not realize the connection between the problems and the torque calculations, 
their responses to the balance beam tasks are conditionally independent of their 
computational skill, even on items for which that skill is an integral component of 
an expert solution. 

4. For children in Stage IV, right/wrong answers to conflict items depend on the 
understanding/computation integration variable, but “How” answers depend only 
on understanding. A child in Stage IV with low computational skill can thus be 
differentiated from a child in Stage m by his higher probabilities of giving Stage IV 
explanations and incorrect answers to pure computation problems. 

Discussion 

This conceptual framework described above holds the promise of extending and 
clarifying standard educational measurement practices in several ways: 

Connections with instruction can be forged more easily than with standard tests, 
because the focus is no longer on how many questions a student can answer, but how they 
answer them. In medical diagnosis, different diseases gave rise to similar results in certain 
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tests; in education, so too can different approaches lead to similar test scores for students. 

But accounting for the patterns of performance, especially if probing adaptively, can 
pinpoint the areas which need attention to best improve performance. 

Student reports can be provided at varying levels and highlighting different features 
of a student’s status. Of particular importance to the student and the teacher are reports in 
terms of levels or stages of understanding of key concepts, since this is the level at which 
instruction is aimed. For the quaHty control purposes of administrators, however, one 
could predict a student’s performance on a standard set of tasks in the domain — say, a 
“market basket’’ of tasks that, ideally, every student should eventually be able to handle. 

Use of Afferent strategies or mental models can be accommodated in an inference 
netwoik. This can take the form of either a single strategy/mental model choice for all tasks 
in a class, as studied by Mislevy and Verhelst (1990), or strategy/model switching from 
one task to another (as in Snow & Lohman, 1984). The nature and the strength of 
inferences one can draw will depend on the potential observational settings. With rich 
information, such as verbal protocols or partial solutions, it may be possible to characterize 
the range of solution methods the student has available and the conditions under which he 
employs them. 

Testing "higher-order thinking” can be accomplished by including unobservable 
nodes for connections among more basic facts or concepts, and observable nodes that 
correspond to tasks for which the relationships of interest arc critical. Because such tasks 
might well be open-ended and approachable in a variety of ways, the possibility of 
alternative solution strategies would need to be built into the network. 

Adaptive testing can be carried out among concepts, not just for a single concept. 
IRT applications of adaptive testing are based on the one-to-many relationships that are 
appropriate for determining overall levels of proficiency, but inadequate for understanding 
connections among concepts. The inference network facilitates stepping variously 
throughout a domain, gathering information about critical domains by presenting tasks that 
call for varying combinations of key skills. 

Haruiling atypical knowledge corfigurations or observational patterns can be 
accomplished by incorporating nodes analogous to the “Other’’ disease state in MUNIN - r 
the catch-all ERT class in Yamamoto’s (1987) Hybrid model. An “Other’’ state of 
understanding is a mechanism for capturing observational patterns that do not accord with 
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those specifically built into the network. A situation-sensitive student report might be 
generated in an instructional system when such a node becomes prominent, signalling that 
more intelligence than is embodied in the system is needed to figure out what this student is 
doing, and decide what to do about it. 



Conclusion 

Learning can be enhanced by a unified conceptual framework for instruction, 
testing, and reporting, because only in such a framework can coherent feedback loops be 
constructed. This presentation has focused on the educational measurement aspect of a 
system built on this premise. The recent introduction of measurement models built around 
states of understanding, and of inferential techniques to connect such pieces into networks 
that describe domains of school learning, provide a foundation for improved educational 
practice in this manner. 
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Appendix 

Equations for the Latent Class Model 



The Model 



Let Tj = (Tlo.-'n4) denote the stage of understanding of a child, with Tjk-l if he or 

she is in Stage k and 0 if not. Let 7t = (ito ^=4) denote the population proportions of 

children in these classes; that is, % = p(Tlk=l)- Let xj represent a response to Task j, 1 if 
correct and 0 if not; j runs from 1 to 24. The conditional probabiUties of correct response 
are Prob(xj=llTlk=l). or Pjk for short. P denotes the matrix ((Pjk)). A vector of item 
responses, x = (xi,...,x24) is assumed to have the following probability conditional on 
Stage membership: 



p(xhik=l) = n (1-Pjk)^' 
j 



( 1 ) 



Similar expressions are assumed to hold for subsets of responses as well, regardless of the 
order in which they are observed. 

The marginal probability of a response vector is an average of terms like (1), 
weighted by the population probabilities of stage membership: 



p(x) p(xlrik=l) Ttk. 

k=0 



( 2 ) 



Let X denote the matrix of response vectors of a sample of N respondents. For a generic 
pattern x^ , let n^ be the number of respondents producing this pattern. The probability of 

X as a function of P and 7t has the form 



p(xip,7t) = c n ’ 



(3) 



where C does not depend on P or 7t. Once X has been observed, (3) can be interpreted as 
a likelihood function, and maxima may be found with respect to P and 7t. 

Because N is only 120 in the balance beam example, a number of constraints were 
introduced so that stable estimates would be obtained. Many could be relaxed or removed 
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with larger samples. The results reported in Table 2 represent the best-fitting result among 
several models with similar numbers of constraints. The PjkS that appear as .333 in Table 
1 were fixed at that value. All four items of a given type were constrained to have the same 
PjkS. For a given column, all PjkS in cells that correspond to 1 ’s in Table 1 were 
constrained to be equal to a single estimated value. Any cells in that column that 
correspond to O’s were constrained to its compl 'ment 



Adaptive Testing 

The maximum likelihood estimates of P and tc were treated as known true 
parameter values during simulated adaptive testing. The uncertainty in these values could 
be taken into account, but we have avoided the complication for this demonstration. 



Before observing any responses from a given child, the expected value of his r\ is 
the population value n. The expected value of a response to a particular item j is obtained 
analogously to (2), simplified to a single, as yet unobserved, response: 

p(xj=l) = p(xj=llTik=l) P('nk=l) 
k 



k 



(4) 



Suppose that Item g is administered to a particular examinee, and the value of Xg, 
either 0 or 1, becomes known. How is this information propagated through the network? 
First, using Bayes theorem, we update probabilities for his q. For k=0,...,4. 



p(qk=llxg) = 



P(xglqk-^) pC'Hk-^) 

X P(’^g*^h=l) p(Bh=l) 
h 



(5) 



This gives new probabilities that the examinee is in each of the possible stages. These are 
in turn reflected in new expectations for items not yet administered by replacing p('nk=l) in 
(4) with p('nk=llxg) to obtain 

p(xj=llxg) = Y p(xj=llqk=l)p('nk=llxg) • 

k (6) 

This process can be repeated with additional items presented one at a time. Let Xs 
represent a partial response sequence; Item s+1 is next administered to form Xs+i. Then 
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(7) 



P(tlk=llxs.i)= P(^s.l=llm=l)p(m=llXs) ^ 

X P(Xs+l=ll^h=l)p(Tlh=lIxs) 
h 

and, for items not yet presented, 

. P(xj=llxs+i) = p(xj=llTik=l) p(rik=llxs+i) . 

k (8) 

Selecting which item to present next and deciding when to stop depends on 
probabilities for "n. In this paper we have addressed only the case in which no decision- 
making cost structure is available, and we address only the goal of minimizing uncertainty 
about Ti. This can be accomplished by mm/mnm cw/ropy adaptive testing. Entropy is a 
measure of randomness. For the five-class balance beam problem, the maximal value of 
entropy occurs when probabilities of all five classes are equal, and the minimal value 
occurs when the probability of one particular stage is one. The general formula for entropy 
after having observed Xg is 

E(xs) = -X P(^k=llXs)log[p(Tik=llXs)] . 

k (9) 

After having observed Xg, one can evaluate the expected entropy associated with the 
administration of any remaining item j as 

E[xgn(xj=0)] p(xj=01xg) -I- E[xgO(xj=l )] p(xj=llxg) , 



The item that minimizes (10) is presented next. 

It bears repeating that these formulae assume both that the model is correct and the 
conditional probabilities are known with certainty. Violations of these assumptions 
generally degrade knowledge about an examinee’s state, making (5) and (8) in particular 
overly optimistic. Work remains to be done, in studying the robustness of the approach to 
violations of the assumptions, learning how to minimize violations in practice, and 
modifying the nxxiel or the conditional probabilities to mitigate inferential errors in the 
presence of violations. 
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TABLE 1 



Theoretical Conditional Probabilities — 
Expected Proportions of correct Response 



Problem t>T)e 


Stage 0 


Stage I 


Stage n 


Stage ni 


Stage rv 


E 


.333 


1.000 


1.000 


1.000 


1.000 


D 


.333 


1.000 


1.000 


1.000 


1.000 


S 


.333 


.000 


1.000 


1.000 


1.000 


CD 


.333 


1.000 


1.000 


.333 


1.000 


CS 


.333 


.000 


.000 


.333 


1.000 


CE 


.333 


.000 


.000 


.333 


1.000 



TABLE 2 

Estimated Conditional Probabilities — 
Expected Proportions of correct Response 


Problem type 


Stage 0 


Stage I 


Stage n 


Stage ni 


Stage rv 


E 


.333* 


.973 


.883 


.981 


.943 


D 


.333* 


.973 


.883 


.981 


.943 


S 


.333* 


.026 


.883 


.981 


.943 


CD 


.333* 


.973 


.883 


.333* 


.943 


CS 


.333* 


.026 


.116 


.333* 


.943 


CE 


.333* 


.026 


.116 


.333* 


.943 




U 



* denotes fixed value 
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FIGURE 1 

The MUNIN Network: Initial Status 




(From Andreassen et al., 1987) 
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FIGURE 2 

The MUNIN Network: After Selected Observations 

(From Andreassen et al., 1987) 
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When the blocks are removed, will the 
beam tip left, tip right, or stay flat? 



Figure 3 

A Sample Balance-Beam Task 
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Item Type 

E 

D 

S 

CD 

CS 

CE 

o 

ERIC 



Sample Item Description 




Equal problems (E). with 
matching weights and lengths on 
both sides. 




Dominant problems (D), with 
unequal weights but equal 
lengths. 



Subordinate problems (S), with 
unequal lengths but equal 
weights. 



Ill^ lill 

^ I 

^111 lill 



Conflict-dominant problems (CD), 
in which one side has greater weight, 
the other has greater length, and the 
side with the heavier weight will go 
down. 

Conflict-subordinate problems 
(CS). in which one side has greater 
weight, the other has greater length, 
and the side with the greater length 
will go down. 



1^11 lill 



Conflict-equal problems (CE). in 
which one side has greater weight, 
the other has greater length, and the 
beam will balance. 



Figure 4 

Sample Balance Beam Items 
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Problem Type 




Figure 5 

Initial State in an Inference Network 
for the Balance Beam Example 



Problem Type 




Figure 6 

State of Knowledge about Cognitive Level 
after an Incorrect Response to an S Item 
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Figure 7 

Posterior Probabilities of Cognitive Levels 
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Figure 8 

Representation of an Extended Balance-Beam Network 
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